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Abstract 

In this work we focus on the problem of image caption 
generation. We propose an extension of the long short term 
memory (LSTM) model, which we coin gLSTM for short. 
In particular, we add semantic information extracted from 
the image as extra input to each unit of the LSTM block, 
with the aim of guiding the model towards solutions that are 
more tightly coupled to the image content. Additionally, we 
explore different length normalization strategies for beam 
search in order to prevent from favoring short sentences. On 
various benchmark datasets such as FlickrSK, FlickrSOK 
and MS COCO, we obtain results that are on par with or 
even outperform the current state-of-the-art. 



LSTM 

a man in a blue shirt is sitting on a bench 


g-LSTM 

a man in a red shirt stands on a bull 


Semantic Information 
Retrieval Result: a man riding a bucking 
bull at a rodeo, a man is being thrown off a 
bull during a rodeo, bullrider at a rodeo 
riding a bull, four men in a rodeo with a bull 
bucking, three people wrestle with a bull at 
a rodeo, men on horses in the rodeo try to 
rope in a bull, a man wearing blue pants is 
riding a bucking bull, the cowboy prepares 
to lasso the bull 


Semantic embedding; 



1. Introduction 

Recent successes in visual classification have shifted the 
interest of the community towards higher-level, more com¬ 
plicated tasks, such as image caption generation l|9][38]|20l 
|28l El] |22l 1231271 [Ill |26l [HI [361 III [371. Although for a 
human describing a picture is natural, it is quite difficult for 
a computer to imitate this task. It requires the computer to 
have some level of semantic understanding of the content of 
an image, including which kinds of objects there are, how 
they look like, what they are doing, and so on. Last but not 
least, this semantic understanding has to be structured into 
a human-like sentence. 

Inspired by recent advances in machine translation |[3 
[HEIl, neural machine translation models have lately been 
applied to the image caption generation task ll26l flTl [36l 
HEIl, with remarkable success. In particular, compared 
to template-based methods, that use a rigid sentence struc¬ 
ture, and transfer-based methods, that re-use descriptions 
available in the training data, methods based on neural ma- 

*This work was carried out while he was in KU Leuven ESAT-PSI. 


Figure 1: Image caption generation using LSTM and the 
proposed gLSTM. The generation by LSTM and gLSTM 
and the cross-modal result that is used as guidance, are 
marked respectively in green, red and blue. 

chine translation models stand out thanks to their capability 
to generate new sentences. They manage to effectively gen¬ 
eralize beyond the sentences seen at training time, which 
is possible thanks to the language model learnt. Most neu¬ 
ral machine translation models follow an encoder-decoder 
pipeline 13 [H [32ll . where the sentence in the source lan¬ 
guage is first encoded into a fixed-length embedding vec¬ 
tor, which is then decoded to generate a new sentence 
in the target language. For machine translation, parallel 
corpora are typically used for learning and evaluating the 
model EnlEa- The pairs of sentences in the source 
and target languages usually share similar sentence struc¬ 
tures (often including regular phrases and the same order of 
words). This structural information is encoded in the fixed- 
length embedding vector and is helpful to the translation. 

Applied to caption generation, the aim is to “translate” 
an image into a sentence describing it. However, it is ques- 
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tionable whether these models can cope with the large dif¬ 
ferences between the two modalities. The structure of the 
visual information is very different from the structure of 
the description to be generated. During the encoding phase, 
the algorithm compresses all visual information into an em¬ 
bedding vector. Yet this vector is unlikely to capture the 
same level of structural information needed for correctly 
generating the textual description in the subsequent decod¬ 
ing phase. 

One of the latest state-of-the-art methods ll^ uses a con¬ 
volutional neural network (CNN) for the encoding step and 
long-short term memory (LSTM) network for the decod¬ 
ing step. We notice that sometimes the generated sentence 
seems to “drift away” or “lose track” of the original im¬ 
age content, generating a description that is common in the 
dataset, yet only weakly coupled to the input image. We 
hypothesize this is because the decoding step needs to find 
a balance between two, sometimes contradicting, forces; on 
the one hand, the sentence to be generated needs to de¬ 
scribe the image content; on the other hand, the generated 
sentence needs to fit the language model, with more likely 
word combinations to be preferred. The system then may 
“lose track” of the original image content if the latter force 
starts to dominate. From an image caption generation point 
of view, however, staying close to the image content may be 
considered the most important of the two. 

To overcome the limitation of the basic encoding¬ 
decoding pipeline, extended pipelines have been proposed 
in the context of both machine translation |[T1 and image 
caption generation EH. They introduce an attention mech¬ 
anism to align the information in both the source and target 
domains, so that the model is able to attend to the most rele¬ 
vant part in the sentence from the source language or image. 

Here, we propose an alternative extension of the LSTM 
model, that works at a more global scale. We start by ex¬ 
tracting semantic information from the image and then use 
it to “guide” the decoder, keeping it “on track” by adding 
a positive bias to words that are semantically linked to the 
image content. More specifically, we add semantic infor¬ 
mation as an extra input to the gate of each LSTM mem¬ 
ory cell. This extra input can take many different forms as 
long as they build a semantic connection between the image 
and its description, e.g. a semantic embedding, a classifica¬ 
tion or retrieval result. As an illustration we experiment 
with features either obtained from a multimodal semantic 
embedding using CCA or, the cross-modal retrieval results 
based on that semantic embedding. 

Our contributions are two-folded. As our main contri¬ 
bution, we present an extension of LSTM which is guided 
by semantic information of image. We coin a term gLSTM 
for the proposed method. We show experimentally on mul¬ 
tiple datasets that such guiding is beneficial for learning to 
generate image captions. As an additional contribution, we 


make the observation that current inference methodologies 
for caption generation are heavily biased towards short sen¬ 
tences. We show experimentally that this hurts the quality 
of the generated sentences and therefore propose sentence 
normalization, which further improves the results. In the 
experiments, we show that the proposed method is on par 
or even outperforms the latest published and unpublished 
state-of-the-art on the popular datasets. 

2. Related Work 

Caption generation. The literature on caption generation 
can be divided into three families. In the first family we 
have template-based methods [38l [20l |28l. These ap¬ 
proaches first detect objects, actions, scenes and attributes, 
then fill them in a fixed sentence template, e.g. using a 
subject-verb-object template. These methods are intuitive 
and can work with out-of-the-box visual classification com¬ 
ponents. However, they require explicit annotations for 
each class. Given the typically small number of categories 
available, these methods do not generate rich enough cap¬ 
tions. Moreover, as they use rigid templates the generated 
sentence is less natural. 

The second family follows a transfer-based caption gen¬ 
eration strategy ET\ l22l |2^ l27l . They are related to the im¬ 
age retrieval. These methods first retrieve visually similar 
images, then transfer captions of those images to the query 
image. The advantage of these methods is that the gener¬ 
ated captions are more human-like than the generations by 
template-based methods. However, because they directly 
rely on retrieval results among training data, there is little 
flexibility for them to add or remove words based on the 
content of an image. 

Inspired by the success of neural networks in machine 
translation isiffliia, recently people have proposed to use 
neural language models for caption generation. Instead of 
translating a sentence from a source language into a target 
one, the goal is to translate an image into a sentence that de¬ 
scribes it. In lfT3l a multimodal log-bilinear neural language 
model is proposed to model the probability distribution of 
a word conditioned on an image and previous words. Simi¬ 
larly, Mao et al. flEj and Karpathy et al. El have proposed 
to use a multimodal recurrent neural network ED model for 
caption generation. Vinyals et al. If3^ and Donahue et al. 
IITI have proposed to use LSTM llT4ll . an advanced Recur¬ 
rent Neural Network for the same task. Very recently, Xu 
et al. EtI have proposed to integrate visual attention into 
the LSTM model in order to fix its gaze on different ob¬ 
jects during the generation of corresponding words. Neural 
language models have shown great prospects in generating 
human-like image captions. Most of these methods follow a 
similar encoding-decoding framework, except for the very 
recent method llJTll which jointly learns visual attention and 
caption generation. However, llJTll requires location sam- 


pling both during training and testing, making the method 
more complicated. While they focus more on local infor¬ 
mation, our method rather exploits global cues. 

Overview. Our work belongs to the third family of caption 
generation methods which uses a neural language model to 
generate captions. Different from the above methods, how¬ 
ever, we propose to make use of the semantic information 
to guide the generation and propose an extension of LSTM 
model, coined gLSTM for the use of semantic informa¬ 
tion. The semantic information here denotes the correlation 
between image and its description, which is obtained in a 
similar manner as in transfer-based methods. Experiments 
illustrate that semantic information brings signihcant im¬ 
provement in the performance and our model outperforms 
recently proposed state-of-the-art methods fTTllMl- Inter¬ 
estingly, the proposed model is able to perform on par with 
the latest and unpublished state-of-the-art llJTll . despite their 
use of more complicated models that require location sam¬ 
pling during training and test stage. 

3. Background 
3.1. The LSTM Model 

A Recurrent Neural Network (RNN) is a good choice 
to model temporal dynamics in sequences. However, it 
is difficult for traditional RNN to learn long-term dynam¬ 
ics because of the issue of vanishing and exploding gradi¬ 
ents ifTrl . The Long Short-Term Memory (LSTM) network 
is proposed in iflTll to address these issues. The core of 
the LSTM architecture is the memory cell, which stores the 
state over time, and the gates, which control when and how 
to update the cell’s state. There are many variants with dif¬ 
ferent connections between the memory cell and the gates. 

The LSTM block that our model is built on follows the 
LSTM with No Peepholes architecture lEl, which is illus¬ 
trated in Ligure|2]in black. The memory cell and gates in an 
LSTM block are dehned as follows; 
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where © represents the element-wise multiplication, a{-) 
represents the sigmoid function and h{-) represents the hy¬ 
perbolic tangent function. The variable ii stands for the in¬ 
put gate , fi for the forget gate, oi for the output gate of the 
LSTM cell. Cl is the state of the memory cell unit and mi is 
the hidden state, that is the output of the block generated by 
the cell. The variable xi is the element of the sequence at 
timestep I and denote the parameters of the model. 



Ligure 2: The LSTM block in black, the proposed gLSTM 
network in black and red. Striped lines stand for external 
connections. By considering semantic information as an ex¬ 
tra input, we encourage the network to refresh its memory 
following a global guide. 

3.2. Caption Generation with LSTM 

The pipeline for caption generation with the RNN 
model ll26irT7ll^ l7l lJ7l is inspired by the encoder-decoder 
principle in Neural Machine Translation lElllIlIIl. An 
encoder is used to map a variable length sequence in the 
source language into a distributed vector and a decoder is 
used to generate a new sequence in the target language con¬ 
ditioned on this vector. During training, the goal is to maxi¬ 
mize the log-likelihood of correct translation given the sen¬ 
tence in the source language. When applying this principle 
to caption generation, the goal becomes to maximize the 
log-likelihood of the image caption given an image, namely 

argmax Vlogp(si.^i|a;\6>), (6) 

where x* denotes an image, denotes a sequence of 
words in a sentence of length L* and 6 denotes the model 
parameters. Lor simplicity, in the following part we drop 
the superscript i whenever it is clear from the context. Since 
each sentence is composed of a sequence of words, it is nat¬ 
ural to use Bayes chain rule to decompose the likelihood of 
a sentence, 

L 

^ogp{si,L\x, 0) = logp(si |a;, logp(si|a;, si,i-i,0), 

1^2 

where si-i stands for the part of the sentence up to theQ^ 
th word. To maximize the objective in eq. (|6]l over the 
whole training corpus, we need to dehne the log-likelihood 
logp(si|a;, 0), which can be modeled with the hid¬ 

den state of a timestep in RNN. The probability distribution 



















of the word at timestep I + 1 over the whole vocabulary is 
computed using the softmax function z{-) based only on the 
output mi of the memory cell, = z{mi) similar to ll^ . 

To feed images and sentences to LSTM, they need to 
be encoded as hxed-length vectors. For the image, CNN 
features are hrst computed and then mapped to an embed¬ 
ding space via a linear transformation. For the sentence, 
each word is hrst represented as a one hot vector and then 
mapped to the same embedding space via a word embed¬ 
ding matrix. Finally, an image and sequence of words in 
a sentence are concatenated to form a new sequence, that 
is, the image is treated as the beginning symbol of the se¬ 
quence and the sequence of words forms the remaining part 
of the new sequence. This sequence is fed to the LSTM 
network for training by iterating the recurrence connection 
for I from 1 to L*. The parameters of the model include the 
linear transformation matrix for image features, the word 
embedding matrix and the parameters of LSTM. 

3.3. Normalized Canonical Correlation Analysis 

To build our semantic representation, we rely on nor¬ 
malized Canonical Correlation Analysis (CCA), proposed 
in Cl to address the cross-modal retrieval problem. 
Canonical correlation analysis (CCA) d is a popular 
method used to map visual and textual features into a com¬ 
mon semantic space. CCA aims at learning projection ma¬ 
trices Ui and U 2 for two views Xi and X 2 such that their 
projections are maximally correlated, namely. 


arg max 
Ul,U2 


U 1 T 1 X 1 X 2 U 2 

y^( 7 iExiXiCi-\/(722x2X2^2 


( 8 ) 


where I]xiX 2 > ^XiXi and £^ 2 X 2 are the covariance ma¬ 
trices. The CCA objective function can be solved via gen¬ 
eralized eigenvalue decomposition. The normalized CCA 
is computed by using a power of the eigenvalues to weight 
the corresponding columns of the CCA projection matrices, 
and followed by L2 normalization, that is. 


X^UiDP _ X 2 U 2 DP 

||Xi(7iL>P|r ||X2(72i7P|| ^ ’ 

where Z? is a diagonal matrix whose elements are set to the 
eigenvalues of corresponding dimensions, while gi and p 2 
denote the semantic representation of the two views. Cosine 
similarity is used to hnd the nearest neighbor in the learned 
common semantic space ifTOl . 

4. The Proposed Methods 

In this section, we describe the proposed extension of the 
LSTM model for the caption generation task. In the new ar¬ 
chitecture, we add semantic information to the computation 
of the gates and cell state. The semantic information here 
is extracted from images and their descriptions, serving as a 
guide in the process of word sequence generation. 


4.1. gLSTM 

The generation of a word in the LSTM model mainly 
depends on the word embedding at the current timstep and 
the previous hidden state (which includes image informa¬ 
tion at the beginning). This process goes step by step until 
it encounters the end token of a sentence. Flowever, as this 
process continues, the role of the image information, which 
is only fed at the beginning, becomes weaker and weaker. 
Words generated at the beginning of a sequence also suffer 
from the same problem. Therefore, for a long sentence, it 
may carry out the generation almost “blindly” towards the 
end of the sentence. Though LSTM is able to keep long¬ 
term memory to some extent, still it poses a challenge for 
sentence generation @[ 11 . In the proposed model, the gen¬ 
eration of words is carried out under the guidance of global 
semantic information. Our extension of LSTM model is 
named gLSTM. The memory cell and gates in a gLSTM 
block are dehned as follows: 


i'l = a{WixXi + Wijnm-i + Wiqg) (10) 

// = a{Wf^xi+Wf^mi_^ + Wfqg) ( 11 ) 

o'l = (t{WoxXI + Womm-l + Woqg) (12) 

~ fi G) Ci_i + ii Q h{WcxXi + 

+ Wcm1Xll-l + Wcqg) (13) 

mi = o'lQc'i (14) 


where g denotes the vector representation of semantic in¬ 
formation. Compared to the standard LSTM architecture, 
in gLSTM we add a new term to the computation of each 
gate and cell state. This new term represents the seman¬ 
tic information which works as a bridge between visual and 
textual domains. The semantic information g does not de¬ 
pend on the timestep I, hence working as a global guide dur¬ 
ing the caption generation. The guidance term can also be 
timestep dependent in expense of higher complexity mod¬ 
els. We summarize with red the gLSTM network architec¬ 
ture additions in Figure [2] 

4.2. Semantic Information. 

In this section, we will detail several kinds of semantic 
information that can be used as guidance in our model. In¬ 
tuitively, there are three ways to extract the semantic infor¬ 
mation. First, we treat it as a cross-modal retrieval task and 
simply use the retrieved sentences as semantic information. 
Alternatively, semantic information can also be represented 
as the embedding in a semantic space where visual and tex¬ 
tual representations are equivalent. The last one is to use 
the image itself as guidance. 

Retrieval-based guidance (ret-gLSTM). The retrieval- 
based guidance is inspired by the transfer-based caption 
generation methods. Though the generated sentence given 
by transfer-based methods may not be totally correct, they 







do have something in common with the true captions an¬ 
notated by humans. Given an image, we first do the cross- 
modal retrieval whose aim is to find relevant texts to the 
query image. Then we collect descriptions with top rank¬ 
ings. Instead of generating a sentence by making direct 
modification on these sentences, we treat these captions as 
auxiliary information and feed them to the neural language 
model we propose in the previous section. These sentences 
may not match perfectly to the image. However, they pro¬ 
vide rich semantic information for the image. Since these 
sentences are annotated by humans, the words in these sen¬ 
tences are very natural and have a high probability to appear 
in the reference captions. 

The cross-modal retrieval method used here is based on 
the normalized CCA mentioned in Section [33] In this pa¬ 
per, image and text features correspond to the two views for 
CCA. CNN features are computed for the images and TF- 
IDF weighted BoW features are computed for the sentences. 
We project both images and sentences from their own do¬ 
main to the common semantic space. Given an image query, 
the closest sentences are then retrieved based on cosine sim¬ 
ilarity. We select the top T retrieved sentences from the 
training set (T = 15 in this paper). These sentences are 
represented by a bag-of-words (BoW) vector which is fed 
as extra input, i.e. the guide to the gLSTM model. 

Semantic embedding guidance (emb-gLSTM). We can 

explicitly use the result of cross-modal retrieval as guid¬ 
ance as mentioned above. We can also implicitly use the 
intermediate result of cross-modal retrieval, that is the se¬ 
mantic representation computed using normalized CCA as 
the extra input. An image is mapped into the common se¬ 
mantic space by the learned projection matrix and the com¬ 
puted semantic embedding is fed to gLSTM model as the 
guide. It is assumed that in the common semantic space 
of CCA both views share equivalent embedding represen¬ 
tation. Therefore, we can treat the projected representation 
from image domain as equal to the one projected from text 
domain. Compared to ret-gLSTM model, the semantic rep¬ 
resentation has much lower dimensionality than the BoW 
representation and saves the computation of finding nearest 
neighbors. In addition, we also find it even performs better 
than the previous method. 

Image as guidance (img-gLSTM). Finally, we experiment 
with the image itself as the extra input. This is motivated 
by the fact that CCA is a linear transformation. A natural 
question then is whether we can learn this projection matrix 
directly during the training of the gLSTM model. There¬ 
fore, we add the image itself as a third kind of extra input. 
We experimentally verify this by simply feeding the image 
feature itself to the gLSTM model, namely g = x, and let 
the network learn the semantic information from scratch. 


4.3. Beam Search with Length Normalization 

In the generation stage, with a vocabulary of size K, 
there are sentences of length I as potential candidates 
for an image caption, where I is unknown. Ideally, we want 
to find the sentence, which maximizes the log-likelihood 
of eq. (|7]l. Considering the exponential search space, how¬ 
ever, exhaustive search is intractable. Therefore, a heuristic 
search strategy is employed instead. 

Here we use beam search, which is a fast and effective 
decoding method for RNN-based models iniEii. At each 
iteration only the T hypotheses generations with the highest 
log-likelihood are kept in the beam pool. The search along 
one beam stops once it encounters an end-of-sequence to¬ 
ken which is generated given previous words along the 
beam. The searching process continues until the searching 
along all beams in the pool stops. 

It is problematic to directly use the log-Iikelihood of 
words as the criterion to select a generation. Since the 
log-likelihood of each single word is negative (because the 
probability is smaller than 1), summation over the log- 
likelihood of more words lead to a smaller value. There¬ 
fore, when the beam width is larger than 1, the beam search 
stops early is more likely to be selected as the final cap¬ 
tion, regardless of the quality of each generated word in the 
beam. That means this kind of beam search favors shorter 
sentence, which is also observed in Ida HI 

Interestingly, the bias towards short sentences tends to 
favor the low order of BLEU scores (BLEU @1,2), com¬ 
monly used to evaluate machine translation algorithms. 
Hence, short sentences not only tend to dominate the in¬ 
ference, but also obscure the evaluations and methodol¬ 
ogy comparisons. To remedy the bias towards short sen¬ 
tences during inference, we propose to normalize the log- 
likelihood of words by length, namely 

F= ^^^logp(si|a:,si:z,6») (15) 

We investigate various forms for U to do the normalization. 

Polynomial normalization. A first possibility is to set 
= |£|™. Notice that when m = 1, eq. (ITsT l becomes 
the definition of the perplexity. We use m = 1 in our paper. 
This kind of normalization punishes short sentences. 

Min-hinge normalization. Intuitively we want to auto¬ 
matically generate a sentence whose length is close to the 
ground truth. Since in the test stage we do not know the 
length in advance, we use the average length of the sen¬ 
tences in the training data as a reference. We define the min- 
hinge length function as Vl{t) — mm{£,ij,}. This means a 
generated sentence is only punished when it is shorter than 
the average length p. Eor sentences that are long enough, 
we only pay attention to their log-likelihood. 



Max-hinge normalization. Similarly, we define the max- 
hinge length function , n(^) = max{^,/i}. Instead of pe¬ 
nalizing short sentences, the max-hinge function favors long 
sentences. 

Gaussian normalization. We can also employ a Gaussian 
function, n(^) ^ ^ normalize the loglikelihood, 

where the fj, and a are the mean and the standard deviation 
of the sentence lengths in the training corpus. The Gaussian 
regularization encourages the inference to select sentences 
that have similar lengths as the sentences in the training set. 

We experimentally verify the effectiveness of these 
strategies in Section lSTl 

5. Experiments 

Datasets and experimental setup. We perform experi¬ 
ments on the following datasets. 

FlickrSk SnS, FlickrSOk ^ and MS COCO The 

FlickrSk dataset is a popular dataset composed of 8, 000 
images in total collected from Flickr, divided into a train¬ 
ing, validation and test set of 6, 000, 1, 000 and 1, 000 im¬ 
ages respectively. Each image in the dataset is accompanied 
with 5 reference captions annotated by humans. Similar to 
FlickrSk, the FlickrSOk dataset contains 31, 000 images col¬ 
lected from Flickr, together with 5 reference sentences pro¬ 
vided by human annotators. However, it does not provide 
a split setting file. So we use the publicly available split 
setting used in lITSl and iflTl . that is, 29,000 images for 
training, 1,000 for validation and 1,000 for testing. The 
large scale dataset MSCOCO contains 82,783 images for 
training and 40504 for validation, with each image associ¬ 
ated with 5 captions. Note that we donot evaluate it on the 
test set used for MS COCO Image Captioning challenge but 
use the publicly available splits used in previous work IfTTlI . 
that is, all 82,783 images from training set for training and 
5,000 images from validation set for validation and testing. 
Evaluation measures. Here we use the two most popu¬ 
lar measures in the machine translation and image caption 
generation literature, namely the BLEU 1291 and the ME¬ 
TEOR 0 measure. 

BLEU is a precision-based metric. The main component 
of BLEU is n-gram precision of the generated caption with 
respect to the references. Precision is computed separately 
for each n-gram and then B@n is computed as a geomet¬ 
ric mean of these precisions. BLEU of high order n-gram 
indirectly measures the grammatical coherence. 

However, BLEU is criticized to favor short sentences. 
It only considers precision but does not take recall into 
consideration. Eor this reason METEOR is also reported in 
recent works mill [371. METEOR evaluates a generated 
sentence by computing a score based on word level matches 
between the generation and a reference and returning the 
maximum score over a set of references. In the computa¬ 
tion of the matching score, it considers unigram-precision. 


unigram-recall and a measure of alignment. Hence, ME¬ 
TEOR accounts for precision, recall and the importance of 
grammaticality. In user evaluation studies METEOR 1241 
has been shown to have a higher correlation with human 
judgments than any order of BLEU. 

All scores are computed using the coco-caption code0. 
Implementation details. In the following experiments we 
use the MatConvNet toolbox l35l and the 16-layer Oxford- 
Net IMl pretrained model to compute CNN features and 
extract the last fully-connected layer’s output as image rep¬ 
resentation. As for preprocessing of texts, for the neural 
language model, we use the publicly available data where 
texts are converted to lowercase, non-alphanumeric charac¬ 
ters are ignored and only words appearing at least 5 times 
in the training set are kept to create a vocabulary. Eor CCA, 
we use the NLTK toolbox||2l to further lemmatize words 
and build a vocabulary based on these words (3000 words 
for flickr8k and 5000 for flickr30k and MS COCO). Then 
tf-idf-weighted BoW vectors are computed as sentence rep¬ 
resentation for CCA. Eor Elickr8k and Elickr30K we set the 
number of dimensions for the image and word embeddings 
and the hidden layer of the gLSTM to 256. Eor MSCOCO 
we set the number to 512 (note that this is much smaller 
than the one used in other work). The gLSTM Models are 
trained with RMSProp 1341 . which is a stochastic gradient 
descent method using an adaptive learning rate algorithm. 
The learning rate is initialized with le-4 for Elickr8k and 
Elicki'30k and 4e-4 for MS COCO. We use dropout and 
early stopping to avoid overfitting and use validation set 
log-likelihood for model selection. Eor CCA, we set p = 4 
as suggested in m and the dimension of the common 
space to 200 for Elickr8k and Elickr30k, and 500 for MS 
COCO which we find works well in practice. At the test 
stage, we set the beam size to 10 for all experiments. We 
built our code for the proposed gLSTM model on Karpa- 
thy’s NeuralTalk codeQ which implements the single model 
in Google’s paper 1^ . Note that we take that model as the 
baseline. 

5.1. Length Normalization 

In this experiment we evaluate the importance of the sen¬ 
tence length normalization to caption generation. We carry 
out the experiment on the Elickr8k dataset and report the 
results in Table [T] Eor clarity we perform this experiment 
based on the LSTM baseline, not gLSTM. 

We observe that compared to the baseline whose selec¬ 
tion is based on unnormalized log-likelihood, length nor¬ 
malization has a positive effect on either the BLEU metric 
or METEOR metric. Polynomial, min-hinge and Gaussian 
normalization respectively bring the largest improvement 

'https://github.com/tylin/coco-caption 

^ https://github.com/karpathy/neuraltalk 





a young boy is running on the beach, a man in a biue shirt is riding a dirt bike, a iittie boy runs away from the approaching waves of the 
ocean, a iittie girl runs across the wet beach, a little girl runs on the wet sand near the ocean, a young girl runs across a wet beach with 
the ocean in the background, child running on the beach, two children are running towards the ocean on a beach, a dog is running in the 
ocean beside the beach, a dog playing in the ocean on the beach , a boy running through surf on a beach, boy running through the water 
at the beach, a girl runs down a beach, a boy standing on a beach, a man riding his bike on the beach by the ocean, a young girl running 
on the beach, a dog is running on the beach, a young child running along the shore at a beach, boy and girl running along the beach, a 
dog running on the beach, a dog running on the beach, a dog running on the beach 


a group of dogs are running on a track, a group of people racing on a track, a a dog with a muzzle is leading several other dogs in a race, 
a greyhound leaps in a race, a muzzled dog in a race with four dogs following, five dogs are racing, five dogs are racing on a dirt track, 
two greyhounds with muzzles race along the inside curb of a railed dirt track, the greyhound racing dogs are running around a bend in 
the track, three muzzled greyhounds race around a turn in a track, several muzzled greyhound dogs racing around a track, two muzzled 
greyhounds dogs racing around a track, two greyhounds race around a track, greyhounds racing chasing a mechanical rabbit around the 
track, three greyhounds are racing on a track at night, three greyhound dogs race around a dark track, muzzled greyhounds are racing 
along a dog track at night, three greyhounds racing around the corner of a track, greyhounds racing on a track, greyhounds race on a 
track, greyhounds race on a track, three greyhounds are in a dog race at the track 



a woman in a black shirt and sunglasses smiles, a man and a woman pose for a picture, a brunette girl wearing sunglasses and a 
yellow shirt, a girl in sunglasses smiles, a girl wearing a yellow shirt and sunglasses smiles, a girl wearing sunglasses smiles for the 
camera, a woman with a yellow shirt wears sunglasses and smiles, a woman wearing sunglasses smiles, young man with upturned 
hair posing with young man with sunglasses and woman with glasses, a blonde woman wearing sunglasses and dice earrings 
smiles, a woman wearing black sunglasses looks to the right and smiles, a smiling woman is wearing sunglasses on a day with 
sparse clouds, a smiling woman with long dark hair wearing sunglasses on top of her head, a man and woman wearing sunglasses 
and white t-shirts smile for the camera, a man in sunglasses smiles, a blonde lady with sunglasses smiles, women in hat and 
sunglasses smiles, a woman wearing sunglasses, man and woman wearing sunglasses posing for picture, woman with green 
sweater and sunglasses smiling, a woman In a sunhat is wearing sunglasses and laughing, a woman wearing sunglasses on her 
head looking down 


Figure 3: Results of the LSTM and gLSTM model. We mark the generated sentence by LSTM and gLSTM respectively in 
green and red, the ground truth references in black and the most relevant retrieval results in blue. We observe that the retrieval 
results are helpful to caption generation. Notice that for the third example, the result of our model is not that accurately but 
still much better than the one of the LSTM model. 


Normalization 

B@1 

B@2 

B@3 

B@4 

METEOR 

Baseline 

59.6 

40.4 

26.1 

17.0 

17.45 

Polynomial 

57.S 

39.2 

26.0 

17.6 

18.86 

Min-hinge 

60.4 

41.4 

27.6 

18.6 

18.53 

Max-hinge 

57.6 

3S.S 

25.2 

16.7 

17.65 

Gaussian 

60.7 

41.7 

27.8 

18.6 

18.35 


Table 1: The performance of different length normalization 
s trategies on FlickrSk. _ 


GTRefs 

Baseline 

Polynom. 

Min-hinge 

Max-hinge 

Gaussian 

10.87(3.74) 

8.75(2.44) 

11.07(2.62) 

9.64(1.92) 

9.55(1.69) 

9.57(3.30) 


Table 2: The average and the standard deviation of the sen¬ 
tence length for the ground truth references, and different 
normalization strategies on FlickrSk. 

to METEOR and BLEU. Therefore, in the following ex¬ 
periments, we only report the performance of the proposed 
gLSTM with these three kinds of length normalization. Be¬ 
sides, we also compute the average length of generated sen¬ 
tences and references. 

5.2. gLSTM with Different Types of Guidance 

In this experiment we evaluate the gLSTM model with 
different types of semantic information, as described in Sec¬ 
tion 14.21 For fair comparison, we also apply beam search 
with length normalization to the baseline. We run this ex¬ 
periment on FlickrSk and report the results in Table[3] 



B@1 

B@2 

CO 

B@4 

METEOR 

Baseline, Original 

59.6 

40.4 

26.1 

17.0 

17.45 

Baseline, Polynomial 

57.8 

39.2 

26.0 

17.6 

18.86 

Baseline, Min-hinge 

60.4 

41.4 

27.6 

18.6 

18.53 

Baseline, Gaussian 

60.7 

41.7 

27.8 

18.6 

18.35 

Baseline 512, Original 

61.0 

42.4 

28.6 

18.9 

18.21 

Baseline 512, Polynomial 

58.2 

40.2 

27.1 

18.1 

19.83 

Baseline 512, Min-hinge 

61.3 

42.9 

29.2 

19.6 

19.13 

Baseline 512, Gaussian 

61.3 

42.8 

29.1 

19.5 

19.07 

ret-gLSTM, Original 

63.4 

43.7 

29.2 

19.3 

18.54 

ret-gLSTM, Polynomial 

58.8 

40.4 

27.5 

18.6 

19.86 

ret-gLSTM, Min-hinge 

63.0 

43.8 

29.9 

20.2 

19.46 

ret-gLSTM, Gaussian 

63.5 

44.2 

30.2 

20.6 

19.38 

emb-gLSTM, Original 

63.7 

44.7 

30.2 

20.2 

19.10 

emb-gLSTM, Polynomial 

61.0 

43.0 

29.6 

20.1 

20.60 

emb-gLSTM, Min-hinge 

64.3 

45.7 

31.6 

21.5 

20.28 

emb-gLSTM, Gaussian 

64.7 

45.9 

31.8 

21.6 

20.19 

img-gLSTM, Original 

61.5 

42.5 

27.2 

16.7 

17.10 

img-gLSTM, Polynomial 

55.7 

38.1 

24.9 

15.8 

17.69 

img-gLSTM, Min-hinge 

60.4 

41.9 

27.6 

17.7 

17.76 

img-gLSTM, Gaussian 

60.1 

41.4 

27.2 

17.3 

17.69 


Table 3: Comparison between gLSTM with different se¬ 
mantic information on FlickrSk. We denote the gLSTM 
model with retrieval-based guidance as ret-gLSTM, the one 
with semantic embedding guidance as emb-gLSTM, and the 
one with image as guidance as img-gLSTM. 

The result illustrates that semantic information brings 
much improvement in the performance, especially emb- 



















FlickrSk 




FlickrSOk 



B@1 

B@2 

B@3 

B@4 

METEOR 

B@1 

B@2 

B@3 

B@4 

METEOR 

LogBilinear kl9V 

65.6 

42.4 

27.7 

17.7 

17.31 

60.0 

38.- 

25.4 

17.1 

16.88 

multimodal RNN hi 

57.9 

38.3 

24.5 

16.0 

16.7 

57.3 

36.9 

24.0 

15.7 

15.3 

Google NIC ^ 

63.- 

41.- 

27.- 

— 

— 

66.3 

42.3 

27.7 

18.3 

— 

LRCN-CaffeNet 

— 

-■ 

— 

— 


58.7 

39.1 

25.1 

16.5 

— 

m-RNN-AlexNet f26\l 

— 

— 

— 

— 

— 

54.- 

36.- 

23.- 

15.- 


m-RNNf26}l 

— 

- ■ 

— 

— 

— 

60.- 

41.- 

28.- 

19.- 

— 

Soft-Attention 

67 .- 

44.8 

29.9 

19.5 

18.93 

66.7 

43.4 

28.8 

19.1 

18.49 

Hard-Attention h37^ 

67 .- 

45.7 

31.4 

21.3 

20.3 

66.9 

43.9 

29.6 

19.9 

18.46 

emb-gLSTM, Polynomial 

61.0 

43.0 

29.6 

20.1 

20.60 

59.8 

41.3 

29.3 

19.2 

18.58 

emb-gLSTM, Min-hinge 

64.3 

45.7 

31.6 

21.5 

20.28 

63.8 

44.1 

30.2 

20.5 

18.13 

emb-gLSTM, Gaussian 

64.7 

45.9 

31.8 

21.6 

20.19 

64.6 

44.6 

30.5 

20.6 

17.91 


Table 4; Comparison with state-of-the-art methods on FlickrSk and FlickrSOk. 


gLSTM, the gLSTM with semantic embedding guidance. 
We also observe that img-gLSTM, the gLSTM with the im¬ 
age itself as guidance, does not bring any improvement but 
even deteriorates the performance. Besides, we also con¬ 
duct an experiment for a baseline but with more parameters 
(512 dimension instead of 256 dimension) for each gate to 
emphasize the improvement mainly comes from the global 
guide. The total number of network parameters is 5.2M in 
total compared to 5.9M and 3. IM for the proposed gLSTM 
ret-gLSTM and B. As is shown in Table |3l we can see that 
increasing parameters indeed improves the performance, 
but still a little worse than the proposed emb-gLSTM even 
though it has much fewer parameters. 

5.3. Comparison with State-of-the-art methods 

We compare the proposed gLSTM with state-of-the-art 
methods for caption generation in the literature. We per¬ 
form the experiment on FlickrSk and FlickrSOk and report 
the results in Table |4] We only evaluate emb-gLSTM in 
this experiment, since it is computationally efficient and ob¬ 
tains the best performance among different models in the 
previous experiment. For most evaluated methods, they 
use CNN with deeper network architecture such as Oxford- 
Net ll^ and GoogLeNet ||33l. Methods which do not use 
a deeper CNN include LRCN-CaffeNet I?) and m-RNN- 
AlexNet lf2^ . Note that Google’s method ll^ uses an en¬ 
semble of multiple LSTM models, while ours only uses a 
single emb-gLSTM model. We can see from the table, the 
proposed emb-gLSTM model performs favorably against 
state-of-the-art approaches. Interestingly, it performs even 
on par with the latest state-of-the-art IIJTII . which is based 
on more complicated and expensive attention mechanisms. 

6. Conclusion 

In this work we have proposed an extension of the LSTM 
model for image caption generation. By adding seman¬ 
tic information as extra input to each unit of the LSTM 
block, we have shown that the model can better stay “on 



B@1 

B@2 

B@3 

B@4 

METEOR 

mCIDEr 

multimodal RNN hl7\ 

62.5 

45.0 

32.1 

23.0 

19.5 

66 

Google NIC h36}J 

66.6 

46.1 

32.9 

24.6 

— 

— 

LRCN-CaffeNet 0 

62.8 

44.2 

30.4 

— 

— 

— 

m-RNNlt26^ 

67 

49 

35 

25 

— 

— 

Soft-Attention h37}J 

70.7 

49.2 

34.4 

24.3 

23.9 

— 

Hard-Attention h37^ 

71.8 

50.4 

35.7 

25.0 

23.04 

— 

emb-gLSTM, Polynomial 

63.8 

46.3 

33.6 

24.8 

23.33 

79.03 

emb-gLSTM, Min-hinge 

66.3 

48.5 

35.4 

26.2 

22.95 

81.26 

emb-gLSTM, Gaussian 

67.0 

49.1 

35.8 

26.4 

22.74 

81.25 


Table 5; Comparison with state-of-the-art methods on MS 
COCO. 

track”, describing the image content without drifting away 
to unrelated yet common phrases. In addition, we explore 
different types of length normalization for beam search 
in order to prevent a bias towards very short sentences, 
which further improves the results. The proposed method 
achieves state-of-the-art performance on various benchmark 
datasets. Moreover, our key contributions are, to a large ex¬ 
tent, complementary to key aspects of other methods, uch 
as attention mechanisms ifJTll or model ensembles lf3^ . in¬ 
dicating that further improvements on performance may be 
obtained by integrating these schemes. 
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